- Simple Linear regression
- Multiple Linear regression
- Qualitative predictors (dummy variables)
- Extensions: interactions, nonlinear effects
- Potential issues: outliers and assumption verification (residual plots and transformations)
9/14/2020
We can can write down \(f(\mathbf{X})\) in the form of an equation: \[y = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p + \varepsilon\]
We interpret \(\beta_j\) as the average effect on \(Y\) of a one unit increase in \(X_j\) , holding all other predictors fixed. But predictors usually change together!
Assumption: \(\varepsilon_1, \dots, \varepsilon_n \stackrel{iid}{\sim} \mathcal{N}(0, \sigma^2)\)
Intervals and testing via \(\hat{\beta}_j\) and \(\text{SE}(\hat{\beta}_j)\) are one-at-a-time procedures.
We can use the F-statistic:
\[F = \dfrac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n-p-1)} \sim F_{p,n-p-1}\]
## ## Call: ## lm(formula = Y ~ X1 + X2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -9.3651 -3.3037 -0.6222 3.1068 10.3991 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 12.05424 1.87387 6.433 4.74e-09 *** ## X1 0.46707 0.26217 1.782 0.078 . ## X2 -0.97619 0.09899 -9.861 2.68e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.756 on 97 degrees of freedom ## Multiple R-squared: 0.5136, Adjusted R-squared: 0.5035 ## F-statistic: 51.2 on 2 and 97 DF, p-value: 6.625e-16
Example: investigate difference in credit card balance between males and females, ignoring the other variables.
We replace the variable gender with the following “dummy” variable: \[x_i = \begin{cases} 0 & \text{if $i^{th}$ person is male} \\ 1 & \text{if $i^{th}$ person is female} \end{cases}\]
Resulting model: \[y_i = \beta_0 + \beta_1 x_i + \varepsilon_i = \begin{cases} \beta_0 + \varepsilon_i & \text{if $i^{th}$ person is male} \\ (\beta_0 + \beta_1) + \varepsilon_i & \text{if $i^{th}$ person is female} \end{cases}\]
Interpretation?
We want to evaluate the difference in house prices in a few different neighborhoods.
## Size Neigh Price ## 1 1.062684 Riverside 109.5524 ## 2 2.461323 Hyde Park 170.7928 ## 3 2.425506 Hyde Park 176.5540 ## 4 2.464293 Riverside 208.7832 ## 5 3.117517 Hyde Park 202.9831 ## 6 2.510854 West Campus 175.3222
Let’s create the dummy variables “Neigh West Campus” and “Neigh Hyde Park”.
## Size Neigh Price ## 1 1.062684 Riverside 109.5524 ## 2 2.461323 Hyde Park 170.7928 ## 3 2.425506 Hyde Park 176.5540 ## 4 2.464293 Riverside 208.7832 ## 5 3.117517 Hyde Park 202.9831 ## 6 2.510854 West Campus 175.3222
## Intercept Size Neigh West Campus Neigh Hyde Park ## 1 1 1.062684 0 0 ## 2 1 2.461323 0 1 ## 3 1 2.425506 0 1 ## 4 1 2.464293 0 0 ## 5 1 3.117517 0 1 ## 6 1 2.510854 1 0
With more than two levels, we create additional dummy variables. For example, for the neighborhood variable we create two dummy variables:
\[\begin{aligned} &x_{i1} = \begin{cases} 0 & \text{if $i^{th}$ house is in West Campus} \\ 1 & \text{if $i^{th}$ house is not in West Campus} \end{cases} \\ &x_{i2} = \begin{cases} 0 & \text{if $i^{th}$ house is in Hyde Park} \\ 1 & \text{if $i^{th}$ house is not in Hyde Park} \end{cases} \end{aligned}\]
Then both of these variables can be used in the regression equation, in order to obtain the model:
\[\begin{aligned} y_i &= \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 \text{size} + \varepsilon \\ &\ \\ &= \begin{cases} \beta_0 + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in Riverside} \\ (\beta_0 + \beta_1) + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in West Campus} \\ (\beta_0 + \beta_2) + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in Hyde Park} \end{cases} \end{aligned}\]
General rule: one fewer dummy variable than the number of levels. The level with no dummy variable - “Riverside” in this example - is known as the baseline.
\[\begin{aligned} y_i &= \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 \text{size} + \varepsilon \\ &\ \\ &= \begin{cases} \beta_0 + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in Riverside} \\ (\beta_0 + \beta_1) + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in West Campus} \\ (\beta_0 + \beta_2) + \beta_3 \text{size} + \varepsilon & \text{if $i^{th}$ house is in Hyde Park} \end{cases} \end{aligned}\]
How to test for a neighbourhood effect?
\[\beta_1 = \beta_2 = 0\]
Confidence interval for the intercept parameters of a house in Hyde Park? It is a linear combination of the parameters!
\[\beta_0 + \beta_2\]
In our previous analysis of the Advertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media.
For example, the linear model \[\text{Sales} = \beta_0 + \beta_1 \text{TV} + \beta_2 \text{Radio} + \varepsilon\] states that the average effect on sales of a one-unit increase in TV is always \(\beta_1\), regardless of the amount spent on radio.
Say that \(\hat{\beta}_{1} = 0.4, \hat{\beta}_{2} = 0.2\). Given a fixed budget of \(\$100,000\), how would you allocate such budget to maximize sales?
The model takes the form
\[\text{Sales} = \beta_0 + \beta_1 \text{TV} + \beta_2 \text{Radio} + \beta_3 \text{TV} \times \text{Radio} + \varepsilon\]
If the p-value for the interaction term \(\text{TV} \times \text{Radio}\) is small, then there is strong evidence for \(\beta_3 \neq 0\).
\[\text{Sales} = \beta_0 + \beta_1 \text{TV} + \beta_2 \text{Radio} + \beta_3 \text{TV} \times \text{Radio} + \varepsilon\]
Say that \(\hat{\beta}_0 = 6.750, \hat{\beta}_1 = 0.019, \hat{\beta}_2 = 0.029, \hat{\beta}_3 = 0.0011\)
Back to the Austin house data (simplified).
The previous version of the model was
\[\begin{aligned} y_i &= \beta_0 + \beta_1 \text{HP} + \beta_2 \text{Size} + \varepsilon \\ &\ \\ &= \begin{cases} \beta_0 + \beta_2 \text{Size} + \varepsilon & \text{if $i^{th}$ house is not in Hyde Park} \\ (\beta_0 + \beta_1) + \beta_2 \text{Size} + \varepsilon & \text{if $i^{th}$ house is in Hyde Park} \end{cases} \end{aligned}\]
Adding an interaction between neighbourhood and size leads to
\[\begin{aligned} y_i &= \beta_0 + \beta_1 \text{HP} + \beta_2 \text{Size} + \beta_3 \text{HP} \times \text{Size} + \varepsilon \\ &\ \\ &= \begin{cases} \beta_0 + \beta_2 \text{Size} + \varepsilon & \text{if $i^{th}$ house is not in Hyde Park} \\ (\beta_0 + \beta_1) + (\beta_2 + \beta_3) \text{Size} + \varepsilon & \text{if $i^{th}$ house is in Hyde Park} \end{cases} \end{aligned}\]
A simple approach for incorporating non-linear associations in a linear model is to include transformed versions of the predictors in the model, e.g.
\[\text{mpg} = \beta_0 + \beta_1 \text{horsepower} + \beta_2 \text{horsepower}^2 + \varepsilon\]
We are predicting mpg using a non-linear function of horsepower, but it is still a linear model with \(X_1 = \text{horsepower}\) and \(X_2 = \text{horsepower}^2\).
We can also use the linear model machinery to fit an exponential curve \[Y = a b^{X} + \varepsilon.\]
In fact, taking the log, we get \[\log (Y) = \underbrace{\log(a)}_{\beta_{0}} + \underbrace{\log(b)}_{\beta_{1}} X + \log(\varepsilon).\] So it basically correspond to fitting the linear model to the transformed data \[\{(x_{1}, \log(y_{1}); \dots; (x_{n}, \log(y_{n})\}\]
An important assumption of the linear regression model is that the error terms, \(\varepsilon_1, \varepsilon_2, \dots, \varepsilon_n\) are uncorrelated.
If the errors are uncorrelated, then the fact that \(\varepsilon_i\) is positive provides little or no information about the sign of \(\varepsilon_{i+1}\).
Another important assumption of the linear regression model is that the error term have a constant variance, \(\text{Var}(\varepsilon_i) = \sigma^2\).
\[h_i = \frac{1}{n} + \frac{(x_i - \overline{x})^2}{\sum_{j=1}^n (x_j - \overline{x})^2}\]
When building a regression model remember that simplicity is your friend. Smaller models are easier to interpret and have fewer unknown parameters to be estimated.
Keep in mind that every additional parameter represents a cost!
The first step of every model building exercise is the selection of the the universe of variables to be potentially used. This task is entirely solved through you experience and context specific knowledge:
With a universe of variables in hand, the goal now is to select the model. Why not include all the variables in?
Big models tend to over-fit and find features that are specific to the data in hand, i.e. not generalizable relationships.
The results are bad predictions and bad science!
In addition, bigger models have more parameters and potentially more uncertainty about everything we are trying to learn.
We need a strategy to build a model in ways that accounts for the trade-off between fitting the data and the uncertainty associated with the model: subset selection, shrinkage, dimension reduction.
Monthly passengers in the US airline industry (in \(1,000\) of passengers) from 1949 to 1960. Predict the number of passengers in the next couple of months.
How about a “trend model”? \(Y_t = \beta_0 + \beta_1 t + \varepsilon_t\)
Let’s look at the residuals: are there obvious patterns?
The variance of the residuals seems to be growing in time. Let’s try to use \(\log(Y_t) = \beta_0 + \beta_1 t + \varepsilon_t\)
Let’s look at the residuals: are there obvious patterns?
Seasonal patterns? Let’s now add dummy variables for each month (only 11 dummies), i.e. \(\log(Y_t) = \beta_0 + \beta_1 t + \beta_2 \text{Feb} + \dots + \beta_{12} \text{Dec} + \varepsilon_t\)
Still not happy, they do not look normal iid!